individual fact
Long-form factuality in large language models Jerry Wei 1 Chengrun Y ang 1 Xinying Song 1 Yifeng Lu
To benchmark a model's long-form factuality in open domains, we first use GPT -4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE).
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Oceania > Australia > South Australia > Adelaide (0.14)
- (48 more...)
- Research Report > Experimental Study (1.00)
- Personal > Honors (0.67)
- Media > Television (1.00)
- Media > Music (1.00)
- Media > Film (1.00)
- (23 more...)
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Oceania > Australia > South Australia > Adelaide (0.14)
- (50 more...)
- Research Report > Experimental Study (1.00)
- Personal > Honors (0.67)
- Overview (0.67)
- Media > Television (1.00)
- Media > Music (1.00)
- Media > Film (1.00)
- (22 more...)
Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models
Yuan, Hongbang, Chen, Yubo, Cao, Pengfei, Jin, Zhuoran, Liu, Kang, Zhao, Jun
Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is \textbf{under-alignment}, rather than \textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose \textbf{APEFT} (\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $\boldsymbol{3.45\%}$ on both ID and OOD datasets, which is highly effective.
- North America > Panama (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California (0.04)
- (6 more...)
Long-form factuality in large language models
Wei, Jerry, Yang, Chengrun, Song, Xinying, Lu, Yifeng, Hu, Nathan, Huang, Jie, Tran, Dustin, Peng, Daiyi, Liu, Ruibo, Huang, Da, Du, Cosmo, Le, Quoc V.
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Oceania > Australia > South Australia > Adelaide (0.14)
- (50 more...)
- Media > Television (1.00)
- Media > Music (1.00)
- Media > Film (1.00)
- (21 more...)
Adapting Novelty to Classical Planning as Heuristic Search
Katz, Michael (IBM Watson Health) | Lipovetzky, Nir (University of Melbourne) | Moshkovich, Dany (IBM Watson Health) | Tuisov, Alexander (The Technion-Israel Institute of Technology)
The introduction of the concept of state novelty has advanced the state of the art in deterministic online planning in Atari-like problems and in planning with rewards in general, when rewards are defined on states. In classical planning, however, the success of novelty as the dichotomy between novel and non-novel states was somewhat limited. Until very recently, novelty-based methods were not able to successfully compete with state-of-the-art heuristic search based planners. In this work we adapt the concept of novelty to heuristic search planning, defining the novelty of a state with respect to its heuristic estimate. We extend the dichotomy between novel and non-novel states and quantify the novelty degree of state facts. We then show a variety of heuristics based on the concept of novelty and exploit the recently introduced best-first width search for satisficing classical planning. Finally, we empirically show the resulting planners to significantly improve the state of the art in satisficing planning.
- Oceania > Australia > Victoria > Melbourne (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)